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Abstract 

Data-efficient learning in continuous state-action 
spaces using very high-dimensional observations 
remains a key challenge in developing fully 
autonomous systems. In this paper, we con¬ 
sider one instance of this challenge, the pix¬ 
els to torques problem, where an agent must 
learn a closed-loop control policy from pixel in¬ 
formation only. We introduce a data-efficient, 
model-based reinforcement learning algorithm 
that learns such a closed-loop policy directly 
from pixel information. The key ingredient is 
a deep dynamical model that uses deep auto¬ 
encoders to learn a low-dimensional embedding 
of images jointly with a predictive model in this 
low-dimensional feature space. Joint learning 
ensures that not only static but also dynamic 
properties of the data are accounted for. This 
is crucial for long-term predictions, which lie at 
the core of the adaptive model predictive con¬ 
trol strategy that we use for closed-loop con¬ 
trol. Compared to state-of-the-art reinforcement 
learning methods for continuous states and ac¬ 
tions, our approach learns quickly, scales to high¬ 
dimensional state spaces and is an important step 
toward fully autonomous learning from pixels to 
torques. 


1. Introduction 

The vision of fully autonomous and intelligent systems 
that learn by themselves has influenced AI and robotics re¬ 
search for many decades. To devise fully autonomous sys¬ 
tems, it is necessary to (1) process perceptual data (e.g., im¬ 
ages) to summarize knowledge about the surrounding envi¬ 
ronment and the system’s behavior in this environment, (2) 
make decisions based on uncertain and incomplete infor¬ 


mation, (3) take new information into account for learning 
and adaptation. Effectively, any fully autonomous system 
has to close this perception-action-learning loop without 
relying on specific human expert knowledge. The pixels 
to torques problem ( |Brock| |2011[ ) identifies key aspects of 
an autonomous system; autonomous thinking and decision 
making using sensor measurements only, intelligent explo¬ 
ration and learning from mistakes. 

We consider the problem of learning closed-loop policies 
(“torques”) from pixel information end-to-end. A possible 
scenario is a scene in which a robot is moving about. The 
only available sensor information is provided by a camera, 
i.e., no direct information of the robot’s joint configura¬ 
tion is available. The objective is to learn a continuous¬ 
valued policy that allows the robotic agent to solve a task 
in this continuous environment in a data-efficient way, i.e., 
we want to keep the number of trials small. To date, there 
is no fully autonomous system that convincingly closes 
the perception-action-learning loop and solves the pixels 
to torques problem in continuous state-action spaces, the 
natural domains in robotics. 


A promising approach toward solving the pixels to torques 
problem is Reinforcement Learning (RL) ([Sutton & Barto 


19981, a principled mathematical framework that deals 


with fully autonomous learning from trial and error. How¬ 
ever, one practical shortcoming of many existing RL algo¬ 
rithms is that they require many trials to learn good poli¬ 
cies, which is prohibitive when working with real-world 
mechanical plants or robots. 

One way of using data efficiently (and therefore keep 
the number of experiments small) is to learn forward 
models of the underlying dynamical system, which are 
then used for internal simulations and policy learning. 
These ideas have been successfully applied to RL, control 
and robotics in (Schmidhuberj 1990 Atkeson & Schaal 


|1997[ [Bagnell & Schneider 2001 Contardo et al. 2013 
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Figure 1. Illustration of our idea of combining deep learning architectures for feature learning and prediction models in feature space. A 
camera observes a robot approaching an object. A good low-dimensional feature representation of an image is important for learning a 
predictive model if the camera is the only sensor available. 


|Pan & Theodorou] |2014t [Deisenroth et al.| |20151 |Pan &| 
[Theodor^ |2014[ [van Hoof et al.[ |2015[ [Levine et al.| 
|2015| ), for instance. However, these methods use heuris¬ 
tic or engineered low-dimensional features, and they do 
not easily scale to data-efficient RL using pixel informa¬ 
tion only because even “small” images possess thousands 
of dimensions. 


A common way of dealing with high-dimensional data is to 
learn low-dimensional feature representations. Deep learn¬ 


ing architectures, such as deep neural networks ([Hinton 


& Salakhutdinov 

20061, stacked auto-encoders ([Bengio 

et al. 2007| Vincent et al.| 

2008|), or convolutional neu- 

ral networks (LeCun et al. [ 

9981, are the current state of 


the art in learning parsimonious representations of high¬ 
dimensional data. Deep learning has been successfully ap¬ 
plied to image, text and speech data in commercial prod¬ 
ucts, e.g., by Google, Amazon and Facebook. 


Deep learning has been used to produce hrst promising 
results in the context of model-free RL on images; For 
instance, (Mnih et al. 2015| l present an approach based 
on Deep-Q-learning, in which human-level game strategies 
are learned autonomously, purely based on pixel informa¬ 
tion. Moreover, ( Lange et al.j 2012| l presented an approach 
that learns good discrete actions to control a slot car based 
on raw images, employing deep architectures for finding 
compact low-dimensional representations. Other examples 
of deep learning in the context of RL on image data in¬ 
clude (|Cucc^£t^ 2011 Koutnik et al. 2013j l. These ap¬ 
proaches have in common that they try to estimate the value 
function from which the policy is derived. However, nei¬ 
ther of these algorithms learns a predictive model and are, 
therefore, prone to data inefficiency, either requiring data 
collection from millions of experiments or relying on dis¬ 
cretization and very low-dimensional feature spaces, limit¬ 
ing their applicability to mechanical systems. 


To increase data efficiency, we therefore introduce a model- 
based approach to learning from pixels to torques. In par¬ 
ticular, exploit results from ( Wahlstrom et al.j 2015| l and 
jointly learn a lower-dimensional embedding of images and 
a transition function in this lower-dimensional space that 


we can use for internal simulation of the dynamical sys¬ 
tem. For this purpose, we employ deep auto-encoders for 
the lower-dimensional embedding and a multi-layer feed¬ 
forward neural network for the transition function. We 
use this deep dynamical model to predict trajectories and 
apply an adaptive model-predictive-control (MFC) algo¬ 
rithm ( jMayne 2014| l for online closed-loop control, which 
is practically based on pixel information only. 

MFC has been well explored in the control community. 
However, adaptive MFC has so far not received much atten¬ 
tion in the literature ( jMaynej |2014| ). An exception is ( jShaj 
20081, where the authors advocate a neural network ap¬ 
proach similar to ours. However, they do not consider high¬ 
dimensional data but assume that they have direct access to 
low-dimensional measurements. 

Our approach benefits from the application of model- 
based optimal control principles within a machine learn¬ 


ing framework. Along these lines, (Deisenroth et al. 
[Abramova et al.j [2012[ [Boedecker et al.[ ' 2014[ 


2009 


'an 




Theodorou 2014 Levine et al. 2015|l suggested to hrst 


learn a transition model and then use optimal control meth¬ 
ods to solve RL problems. Unlike these methods, our ap¬ 
proach does not need to estimate value functions and scales 
to high-dimensional problems. 


Similar to our approach, ( Boots et al.[[2014][Levine et al.[ 
[2015[ [van Hoof et al.[ [2015[ l recently proposed model- 
based RL methods that learn policies directly from vi¬ 
sual information. Unlike these methods, we exploit a low¬ 
dimensional feature representation that allows for fast pre¬ 
dictions and online control learning via MFC. 


Problem Set-up and Objective 

We consider a classical A^-step hnite-horizon RL setting 
in which an agent attempts to solve a particular task by 
trial and error. In particular, our objective is to hnd a 
closed-loo^ policy tt* that minimizes the long-term cost 
= X]t=o foixtjUt), where /o denotes an immediate 
cost, Xt S is the continuous-valued system state and 
Ut G are continuous control inputs. 
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Input layer Hidden layer Output layer 

(high-dim. data) (feature) (reconstructed) 



Figure 2. Auto-encoder that consists of an encoder g~^ and a 
decoder g. The encoder maps the original image yt € 
onto its low-dimensional representation Zt = g~^iyt) £ R’^, 
where m M\ the decoder maps this feature back to a high¬ 
dimensional representation 5/t = g{%)- The gray color represents 
high-dimensional observations. 



Figure 3. Prediction model: Each feature Zi is computed from 
high-dimensional data j/i via the encoder g~^. The transition 
model predicts the feature Zt+iih„ at the next time step based 
on the n-step history of n past features zt-n+i, ■ ■ ■, zt and con¬ 
trol inputs Ut-n+i, ■ ■ ■ ,Ut- The predicted feature can be 

mapped to a high-dimensional prediction yt+\ via the decoder g. 
The gray color represents high-dimensional observations. 

2.1. Deep Auto-Encoder 


The learning agent faces the following additional chal¬ 
lenges: (a) The agent has no access to the true state, but 
perceives the environment only through high-dimensional 
pixel information (images), (b) a good control policy is re¬ 
quired in only a few trials. This setting is practically rel¬ 
evant, e.g., when the agent is a robot that is monitored by 
a video camera based on which the robot has to learn to 
solve tasks fully autonomously. Therefore, this setting is 
an instance of the pixels to torques problem. 


2. Deep Dynamical Model 


Our approach to solve the pixels-to-torques problem is 
based on a deep dynamical model (DDM), which jointly 
(i) embeds high-dimensional images in a low-dimensional 
feature space via deep auto-encoders and (ii) learns a pre¬ 


dictive forward model in this feature space (Wahlstrom 


et al. 2015| l. In particular, we consider a DDM with con¬ 
trol inputs u and high-dimensional observations y. We as¬ 
sume that the relevant properties of y can be compactly 
represented by a feature variable z. The two components 
of the DDM, i.e., the low-dimensional embedding and 
the prediction model, which predicts future observations 
yt+i based on past observations and control inputs, are de¬ 
tailed in the following. Throughout this paper, yt denotes 
the high-dimensional measurements, zt the corresponding 
low-dimensional encoded features and yt the reconstructed 
high-dimensional measurement. Further, %+! and yt+i de¬ 
note a predicted feature and measurement at time f -I- 1, 
respectively. 


We use a deep auto-encoder for embedding images in a 
low-dimensional feature space, where both the encoder g~^ 
and the decoder g are modeled with deep neural networks. 
Each layer k of the encoder neural network g~^ computes 
^(fc-i-i) _ _j_ where cr is a sigmoidal acti¬ 

vation function (we used arctan) and Ak and are free 
parameters. The input to the first layer is the image, i.e., 
y[^'^ = yt. The last layer is the low-dimensional fea¬ 
ture representation of the image Zt(0E) = 
where 0 e = [■■•, &fc, •■•] are the parameters of all neu¬ 
ral network layers. The decoder g consists of the same 
number of layers in reverse order, see Fig. and ap¬ 
proximately inverts the encoder g, such that yt(0E,6>D) = 
9{9~^{yt',0E)',0u) ~ yt is the reconstructed version of yt 
with an associated reconstruction error 

ef (0E, ^d) = yt — 2/t(0E, ^*d)- (1) 

The main purpose of the deep auto-encoder is to keep this 
reconstruction error and the associated compression loss 
negligible, such that the features Zt are a compact repre¬ 
sentation of the images yt- 

2.2. Prediction Model 

We now turn the static auto-encoder into a dynamical 
model that can predict future features and images 
yt+i- The encoder g~^ allows us to map high-dimensional 
observations yt onto low-dimensional features Zt- For pre¬ 
dicting we assume th&t future features ft+ilh^ depend on 
an n-step history hn of past features and control inputs, 
i.e., 

%+l\h„ (Op) = f{zt, Ut,..., Zt-n+l,Ut-n+i;dp), (2) 
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where / is a nonlinear transition function, in our case a 
feed-forward neural network, and 9-p are the correspond¬ 
ing model parameters. This is a nonlinear autoregressive 
exogenous model (NARX) (Ljung 1999| l. The predictive 
performance of the model will be important for model pre¬ 
dictive control (see Sec tion]^ a nd for model learning based 
on the prediction error (Ljung 1999)1. 


To predict/wfMre observations yt+i\h„ we exploit the de¬ 
coder, such that yt+i\h^ = g{zt+i\h^]6Y)). The deep de¬ 
coder g maps features z to high-dimensional observations 
y parameterized by Sq. 


Now, we are ready to put the pieces together: With feature 
prediction model (|^ and the deep auto-encoder, the DDM 
predicts future features and images according to 

ztiOti) = g~^{yuOB), (3a) 

2t-ri|h„(6'E, 6'p) = f{zt, ut,..., 6ip), 

= g{zt+l\h^',()D), (3b) 

which is illustrated in Fig. With this prediction model 
we define the prediction error 

et+l(0E, 6*p) = yt+1 — 2/t+l|/i„(^E,6*D,0p), (4) 

where yt+i is the observed image at time t -\- 1. 


2.3. Training 

The DDM is parameterized by the encoder parameters 0-e, 
the decoder parameters 6 *d and the prediction model param¬ 
eters 0p. In the DDM, we train both the prediction model 
and the deep auto-encoder jointly by finding parameters 
(0Ei Sq, 0p), such that 

(0E, ^D, Op) =argmin I4(6 »e, 6»d) + Lp(6»e, 6>d, 6»p), (5a) 

^E,^D,^P 

Vp{0p,ep,,ep) = ||£P(0E,0D,0p)f, (5b) 

FR(0E,0D)=^^'ljkf(0E,0D)f, (5c) 

which minimizes the sums of squared reconstruction 0 
and prediction (0 errors. 

We learn all model parameters 0 e, 0d, Op jointly by solv¬ 
ing ([Sail p] The required gradients with respect to the param¬ 
eters are computed efficiently by back-propagation, and the 

'Normally when features are used for learning dynamical 
models, they are first extracted from the data in a pre-processing 
step by minimizing with respect to the auto-encoder param¬ 
eters 6e,0o- In a second step, the prediction model parameters 
dp are estimated based on these features by minimizing con¬ 
ditioned on the estimated 8e and Oo- In our experience, a prob¬ 
lem with this approach is that the learned features might have a 
small reconstruction error, but this representation will not be ideal 
for learning a transition model. The supplementary material dis¬ 
cusses this in more detail. 


cost function is minimized by the BFGS algorithm ( iNo^ 
). Note that in ( [Sai l it is crucial to 
include not only the prediction error Vp, but also the re¬ 
construction error Vr. Without this term the multi-step 
ahead prediction performance will decrease because pre¬ 
dicted features are not consistent with features achieved 
from the encoder. Since we consider a control problem in 
this paper, multi-step ahead predictive performance is cru¬ 
cial. 


cedal & Wright) 2006 


Initialization. With a linear activation function the auto¬ 
encoder and PCA are identical ( )Bourlard & Kam^ )1988) l, 
which we exploit to initialize the parameters of the auto¬ 
encoder: The auto-encoder network is unfolded, each pair 
of layers in the encoder and the decoder are combined, and 
the corresponding PCA solution is computed for each of 
these pairs. We start with high-dimensional image data at 
the top layer and use the principal components from that 
pair of layers as input to the next pair of layers. Thereby, we 
recursively compute a good initialization for all parameters 
of the auto-encoder. Similar pre-training routines are found 
in ( |Hinton & Salakhutdinovj )2006) l, in which a restricted 
Boltzmann machine is used instead of PCA. 


In this section, we have presented a DDM that facili¬ 
tates fast predictions of high-dimensional observations via 
a low-dimensional embedded time series. The property of 
fast predictions will be exploited by the online feedback 
control strategy presented in the following. More details on 
the proposed model are given in (Wahlstrom et al. 2015)1. 


3. Learning Closed-Loop Policies from 
Images 

We use the DDM for learning a closed-loop policy by 
means of nonlinear model predictive control (MPC). We 
start off by an introduction to classical MPC, before mov¬ 
ing on to MPC on images in Section jXT) MPC finds an op¬ 
timal sequence of control signals that minimizes a AT-step 
loss function, where K is typically smaller than the full 
horizon. In general, MPC relies on (a) a reference trajec¬ 
tory Xref = Xi ,(which can be a constant reference 
signal) and (b) a dynamics model 


Xt+l = f{xt,Ut), 


( 6 ) 


which, assuming that the current state is denoted by Xg, can 
be used to compute/predict a state trajectory xp,..., Xk for 
a given sequence uq, ..., uk-i of control signals. Using 
the dynamics model MPC determines an optimal (open- 
loop) control sequence Ug,..., such that the pre¬ 

dicted trajectory xi,..., xk gets as close to the reference 
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trajectory Xref as possible, such that 

K-l 

G argmin ^ p* - xj'f + Aljutf, (7) 

uo:K-i .^0 

where pj — Xj |p is a cost associated with the deviation of 
the predicted state trajectory xo:if_i from the reference tra¬ 
jectory Xref, and ptlp penalizes the amplitude of the con¬ 
trol signals. Note that the predicted x* depends on all pre¬ 
vious uo:K-i- When the control sequence Uq, ..., 
is determined, the first control Uq is applied to the system. 
After observing the next state, MPC repeats the entire op¬ 
timization and turns the overall policy into a closed-loop 
(feedback) control strategy. 

3.1. MPC on Images 

We now turn the classical MPC procedure into MPC on im¬ 
ages by exploiting some convenient properties of the DDM. 
The DDM allows us to predict features ,..., SA- based 
on a sequence of controls uq, , uk-i. By comparing (|^ 
with (|^, we define the state xq as the present and past n — 1 
features and the past n — 1 control inputs, such that 

Xo = [zo, . . . , Z-n+l,U-i, . . . , U-n+l]. (8) 

The DDM computes the present and past features with the 
encoder Zt = such that Xg is known at the 

current time, which matches the MPC requirement. Our 
objective is to control the system towards a desired refer¬ 
ence image frame This reference frame j/ref can also 
be encoded to a corresponding reference feature Zref = 
g~^{yref, 6*e), which results in the MPC objective 

K-l 

G argmin ^ ||zt - ZreflP + Allutf, (9) 

where xq, defined in (|^, is the current state. The gradi¬ 
ents of the cost function (|^ with respect to the control sig¬ 
nals Uq, ..., uk-1 are computed in closed form, and we 
use BFGS to find the optimal sequence of control signals. 
Note that the objective function depends on mq, ..., uk-i 
not only via the control penalty IjutlP but also via the fea¬ 
ture predictions Zi.,k-i of the DDM via (|^. 

Overall, we now have an online MPC algorithm that, given 
a trained DDM, works indirectly on images by exploiting 
their feature representation. In the following, we will now 
turn this into an iterative algorithm that learns predictive 
models from images and good controllers from scratch. 

3.2. Adaptive MPC for Learning from Scratch 

We will now turn over to describe how (adaptive) MPC can 
be used together with our DDM to address the pixels to 
torques problem and to learn from scratch. At the core 


of our MPC formulation lies the DDM, which is used to 
predict future states ([^ from a sequence of control inputs. 
The quality of the MPC controller is inherently bound to 
the prediction quality of the dynamical model, which is 
typical in model-based RL ( |Schneid^ | 1997[ |Schaal|| T997[ 
[Deisenroth et al.[|2015| ). 

To learn models and controllers from scratch, we apply a 
control scheme that allows us to update the DDM as new 
data arrives. In particular, we use the MPC controller in 
an adaptive fashion to gradually improve the model by col¬ 
lected data in the feedback loop without any specific prior 
knowledge of the system at hand. Data collection is per¬ 
formed in closed-loop (online MPC), and it is divided into 
multiple sequential trials. After each trial, we add the data 
of the most recent trajectory to the data set, and the model 
is re-trained using all data that has been collected so far. 


Algorithm 1 Adaptive MPC in feature space 

Follow a random control strategy and record data 

loop 

Update DDM with all data collected so far 
for f = 0 to A^ — 1 do 

Get state xt via auto-encoder 

ul -(r- e-greedy MPC policy using DDM prediction 

Apply and record data 

end for 
end loop 


Simply applying the MPC controller based on a randomly 
initialized model would make the closed-loop system very 
likely to converge to a point, which is far away from the 
desired reference value, due to the poor model that can¬ 
not extrapolate well to unseen states. This would in turn 
imply that no data is collected in unexplored regions, in¬ 
cluding the region that we actually are interested in. There 
are two solutions to this problem; Either we use a proba- 
model as suggested in ( |Schneid^ |1997[ 
2015| l to explicitly account for model un¬ 
implied natural exploration or we follow 
an explicit exploration strategy to ensure proper excitation 
of the system. In this paper, we follow the latter approach. 
In particular, we choose an e-greedy exploration strategy 
where the optimal feedback Uq at each time step is selected 
with a probability 1 — e, and a random action is selected 
with probability e. 

Algorithm[2summarizes our adaptive online MPC scheme. 
We initialize the DDM with a random trial. We use the 
learned DDM to find an e-greedy policy using predicted 
features within MPC. This happens online. The collected 
data is added to the data set and the DDM is updated after 
each trial. 


bihstic dynamics 
Deisenroth et al. 
certainty and the 

















From Pixels to Torques: Policy Learning with Deep Dynamical Models 


True video frames 



yt+o vt+i yt +2 yt+3 yt+i j/r+s yt+(, yt+r yt+e 

Predicted video frames 



yt+o\t yt+iir yt+ 2 \t yt+3\t yt+nt yt+5\t yt+6\t yt+nt Vt+sit 


Figure 4. Long-term (up to eight steps) predictive performance 
of the DDM: True (upper plot) and predicted (lower plot) video 
frames on test data. 

4. Experimental Results 


aMMSBin 

iiiBtfBfiav 
iiTMSDUffl 
■ kVBBBkl 


(a) Autoencoder and prediction model 


In the following, we empirically assess the components of 
our proposed methodology for autonomous learning from 
high-dimensional synthetic image data: (a) the quality of 
the learned DDM and (b) the overall learning framework. 

In both cases, we consider a sequence of images (51 x 51 = 
2601 pixels) and a control input associated with these im- 

(i) 

ages. Each pixel yij. is a component of the measurement 
yt e JJ 2601 assumes a continuous gray-value in the in¬ 
terval [0,1]. No access to the underlying dynamics or the 
state (angle ip and angular velocity if) was available, i.e., 
we are dealing with a high-dimensional continuous state 
space. The challenge was to learn (a) a good dynamics 
model (b) a good controller from pixel information only. 
We used a sampling frequency of 0.2 s and a time horizon 
of 25 s, which corTesponds to 100 frames per trial. 

The input dimension has been reduced to dim(?/t) = 
50 prior to model learning using PCA. With these 50- 
dimensional inputs, a four-layer auto-encoder network was 
used with dimension 50-25-12-6-2, such that the features 
were of dimension dim( 2 :t) = 2, which is optimal to model 
the periodic angle of the pendulum. The order of the dy¬ 
namics was selected to be n = 2 (i.e., we consider two 
consecutive image frames) to capture velocity information, 
such that 2*+1 = f{zt,ut,zt-i,ut-i). For the prediction 
model / we used a feedforward neural network with a 6-4- 
2 architecture. Note that the dimension of the first layer is 
given by n{dim{zt) + dim(Mt)) = 2(2 -|- 1) = 6. 


■■BDaiHHB 

B bbquhI 

BBI 


IBBBBMBI 

BBSHHHIBEai 


(b) Only auto-encoder 


Figure 5. Feature space for both joint |(a)| and sequential training 
|(b)| of auto-encoder and prediction model. The feature space is 
divided into grid points. For each grid point the decoded high¬ 
dimensional image is displayed and the feature values for the 
training data (red) and validation data (yellow) are overlain. For 
the joint training the feature values reside on a two-dimensional 
manifold that corresponds to the two-dimensional position of the 
tile. For the separate training the feature values are scattered with¬ 
out structure. 


4.1. Learning Predictive Models from Pixels 

To assess the predictive performance of the DDM, we took 
601 screenshots of a moving tile, see Fig. The control 
inputs are the (random) increments in position in horizontal 
and vertical directions. 

We evaluate the performance of the learned DDM in terms 
of long-term predictions, which play a central role in MPC 
for autonomous learning. Long-term predictions are ob¬ 
tained by concatenating multiple 1-step ahead predictions. 

The performance of the DDM is illustrated in Fig. I^on a 


test data set. The top row shows the ground truth images 
and the bottom row shows the DDM’s long-term predic¬ 
tions. The model predicts future frames of the tile with high 
accuracy both for 1-step ahead and multiple steps ahead. 
The model yields a good predictive performance for both 
one-step ahead prediction and multiple-step ahead predic¬ 
tion. 


In Fig. 5(a) the feature representation of the data is dis¬ 
played. The features reside on a two-dimensional manifold 
that encodes the two-dimensional position of the moving 
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Figure 6. The feature space 2 € [—1,1] X [-1, 1] is divided into 
9x9 grid points for illustration purposes. For each grid point the 
decoded high-dimensional image is displayed. Green: Feature 
values that correspond to collected experience in previous trials. 
Cyan: Feature value that corresponds to the current time step. 
Red: Desired reference value. Yellow: 15-steps-ahead prediction 
after optimizing for the optimal control inputs. 


tile. By inspecting the decoded images we can see that 
each corner of the manifold corresponds to a corner po¬ 
sition of the tile. Due to this structure a relatively simple 
prediction model is sufficient to describe the dynamics. In 
case the auto-encoder and the prediction model would have 
been learned sequentially (first training the auto-encoder, 
and then based on these features values train the predic¬ 
tion model) such a structure would not have been enforced. 
In Fig. |5(b)| the corresponding feature representation is 
displayed where only the auto-encoder has been trained. 
Clearly, these features does not exhibit such a structure. 

4.2. Closed-Loop Policy Learning from Pixels 

In this section, we report results on learning a policy that 
moves a pendulum (1 -link robot arm with length 1 m, 
weight 1 kg and friction coefficient 1 Nsm/rad) from a start 
position (/? = 0 to a target position Lp = ±7r. The reference 
signal was the screenshot of the pendulum in the target po¬ 
sition. For the MFC controller, we used a planning horizon 
of P = 15 steps and a control penalty A = 0.01. For the 
e-greedy exploration strategy we used e = 0.2. We con¬ 
ducted 50 independent experiments with different random 
initializations. The learning algorithm was run for 15 trials 
(plus an initial random trial). After each trial, we retrained 


1 St trial 


4th trial 


7 th trial 



Figure 7. Control performance after 1st to 15th trial evaluated 
with e = 0 for 16 different experiments. The objective was to 
reach an angle of ±7r. 


the DDM using all collected data so far, where we also in¬ 
clude the reference image while learning the auto-encoder. 

Fig. |6] displays the decoded images corresponding to 
learned latent representations in [—1,1]^. The learned fea¬ 
ture values of the training data (green) line up in a circular 
shape, such that a relatively simple prediction model is suf¬ 
ficient to describe the dynamics. If we would not have opti¬ 
mized for both the prediction error and reconstruction error, 
such an advantageous structure of the feature values would 
not have been obtained. The DDM extracts features that 
can also model the dynamic behavior compactly. The figure 
also shows the predictions produced by the MFC controller 
(yellow), starting from the current time step (cyan) and tar¬ 
geting the reference feature (red) where the pendulum is in 
the target position. 

To assess the controller performance after each trial, we 
applied a greedy policy (e = 0). In Fig.|^ angle trajectories 
for 15 of the 50 experiments at different learning stages are 
displayed. In the first trial, the controller managed only in a 
few cases to drive the pendulum toward the reference value 
±7r. The control performance increased gradually with the 
number of trials, and after the 15th trial, it manages in most 
cases to get it to an upright position. 


To assess the data efficiency of our approach, we compared 
it with the FILCO RL framework ( [Deisenroth et al.| |2015[ ) 
to learning closed-loop control policies for the pendulum 
task above. FILCO is a current state-of-the art model-based 
RL algorithm for data-efficient learning of control policies 
in continuous state-control spaces. Using collected data 
FILCO learns a probabilistic model of the system dynam¬ 


ics, implemented as a Gaussian process (GF) (Rasmussen 


& Williams 2006|l. Subsequently, this model is used to 


compute a distribution over trajectories and the correspond- 
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Figure 8. Average learning success with standard errors. Blue: 
PILCO ground-truth RL baseline using the true state (ip, i^). Red: 
PILCO with learned auto-encoder features from image pixels. 
Cyan: PILCO on 20D feature determined by PCA. Black: Our 
proposed MPC solution using the DDM. 

ing expected cost, which is used for gradient-based opti¬ 
mization of the controller parameters. 

Although PILCO uses data very efficiently, its computa¬ 
tional demand makes its direct application impractical for 
many data points or high-dimensional 20 D) problems, 
such that we had to make suitable adjustments to apply 
PILCO to the pixels-to-torques problem. In particular, we 
performed the following experiments: (1) PILCO applied 
to 20D PCA features, (2) PILCO applied to 2D features 
learned by deep auto-encoders, (3) An optimal baseline 
where we applied PILCO to the standard RL setting with 
access to the “true” state ((p, ip) ( |Deisenroth et aH|20I5[ ). 

Fig. displays the average success rate of PILCO (in¬ 
cluding standard error) and our proposed method using 
deep dynamical models together with a tailored MPC 
(DDMh-MPC). We dehne “success” if the pendulum’s an¬ 
gle is stabilized within 10° around the target statej^ The 
baseline (PILCO trained on the ground-truth 2D state 
(p, (p)) is shown in blue and solves the task very quickly. 
The graph shows that our proposed algorithm (black), 
which learns torques directly from pixels, is not too far 
behind the ground-truth RL solution, achieving a n almost 
90% success rate after 15 trials (1500 image frames). How¬ 
ever, PILCO trained on the 2D auto-encoder features (red) 
and 20D PCA features fail consistently in all experiments 
We explain PILCO’s failure by the fact that we trained the 
auto-encoder and the transition dynamics in feature space 

^ Since we consider a continuous setting, we have to define a 
target region. 


separately. The auto-encoder finds good features that min¬ 
imize the reconstruction error. However, these features are 
not good for modeling the dynamic behavior of the sys- 
temljand lead to bad long-term predictions. 

Computation times of PILCO and our method are vastly 
different: While PILCO spends most time optimizing pol¬ 
icy parameters, our model spends most of the time on learn¬ 
ing the DDM. Computing the optimal nonparametric MPC 
policy happens online and does not require significant com¬ 
putational overhead. To put this into context, PILCO re¬ 
quired a few days of learning time for 10 trials (in a 20D 
feature space). In a 2D feature space, running PILCO for 
10 trials and 1000 data points requires about 10 hours. 

Overall, our DDMh-MPC approach to learning closed-loop 
policies from high-dimensional observations exploits the 
learned Deep Dynamical Model to learn good policies 
fairly data efficiently. 

5. Conclusion 

We have proposed a data-efficient model-based RL algo¬ 
rithm that learns closed-loop policies in continuous state 
and action spaces directly from pixel information. The key 
components of our solution are (1) a deep dynamical model 
(DDM) that is used for long-term predictions in a compact 
feature space and (2) an MPC controller that uses the pre¬ 
dictions of the DDM to determine optimal actions on the fly 
without the need for value function estimation. For the suc¬ 
cess of this RL algorithm it is crucial that the DDM learns 
the feature mapping and the predictive model in feature 
space jointly to capture dynamic behavior for high-quality 
long-term predictions. Compared to state-of-the-art RL our 
algorithm learns fairly quickly, scales to high-dimensional 
state spaces and facilitates learning from pixels to torques. 
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